Mapping the geographical diffusion of new words 



Language in social media is rich with linguistic innovations, most strikingly in 
the new words and spellings that constantly enter the lexicon. Despite assertions 
about the power of social media to connect people across the world, we find that 
many of these neologisms are restricted to geographically compact areas. Even for 
words that become ubiquituous, their growth in popularity is often geographical, 
spreading from city to city. Thus, social media text offers a unique opportunity to 
study the diffusion of lexical change. In this paper, we show how an autoregres- 
sive model of word frequencies in social media can be used to induce a network 
of linguistic influence between American cities. By comparing the induced net- 
work with the geographical and demographic characteristics of each city, we can 
measure the factors that drive the spread of lexical innovation. 

1 Introduction 

As social communication is increasingly conducted through written language on computers and 
cellphones, writing must evolve to meet the needs of this less formal genre. Computer-mediated 
communication (CMC) has seen a wide range of linguistic innovation, including emoticons, abbre- 
viations, expressive orthography such as lengthening [l], and entirely new words \2, 3, 4|. While 
these developments have been celebrated by some |5| and lamented by many others, they offer an 
intriguing new window on how language can change on the lexical level. 

In principle social media can connect people across the world, but in practice social media connec- 
tions tend to be quite local El?). Many social media neologisms are geographically local as well: 
our prior work has identified dozens of terms that are used only within narrow geographical ar- 
eas 12). Some of these terms were previously known from spoken language (for example, hella | 8 |), 
and others may refer to phonological language differences (the spelling suttin for something). But 
there are many terms that seem disconnected from spoken language variation — for example, dif- 
ferences that are only apparent in the written form, such as the spelling uu for you. This suggests 
that some geographically-specific terms have evolved through computer-mediated communication. 

To study this evolution, we have gathered a large corpus of geotagged Twitter messages over the 
course of nearly two years, a time period which includes the introduction and spread of a number of 
new terms. Three examples are shown in Figure[T] 

• The first term, bruh, is an alternative spelling of bro, short for brother. At the beginning 
of our sample it is popular in a few southeastern cities; it then spreads throughout the 
southeast, and finally to the west coast. 

• The middle term, of, is an abbreviation for as fuck, as in I'm bored af. It is initially found 
in southern California, then jumps to the southeast, before gaining widespread popularity. 

• The third term, is an emoticon. It is initially used in several east coast cities, then 
spreads to the west coast, and finally gains widespread popularity. 
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Figure 1: Change in popularity for three words: bruh, af, Blue circles indicate cities in which the proba- 
bility of an author using the word in a week in the listed period is greater than 1% (2.5% for bruh). 



Clearly geography plays some role in each example — most notably in bruh, which may reference 
southeastern phonology. But each example also features long range "jumps" over huge parts of the 
country. What explains, say, the movement of af from southern California all the way to a cluster of 
cities around Atlanta? 

This paper presents a quantitative model of the geographical spread of lexical innovations through 
social media. Using a novel corpus of millions of Twitter messages, we are able to trace the changing 
popularity of thousands of words. We build an autoregressive model that captures aggregate patterns 
of language change across 200 metropolitan areas in the United States. The coefficients of this model 
correspond to sender/receiver pairs of cities that are especially likely to transmit linguistic influence. 

After inducing a network of cities linked by linguistic influence, we search for the underlying factors 
that explain the network's structure. We show that while geographical proximity plays a strong role 
in shaping the network of linguistic influence, demographic homophily is at least equally important. 
Going beyond homophily, we identify asymmetric features that make individual cities more likely 
to send or receive lexical innovations. 



2 Related work 

This paper draws on several streams of related work, including sociolinguistics, theoretical models 
of language change, and network induction. 

2.1 Sociolinguistic models of language change 

Language change has been a central concern of sociolinguistics for several decades. This tradition 
includes a range of theoretical accounts of language change that provide a foundation for our work. 

The wave model argues that linguistic innovations are spread through interactions over the course 
of an individual's life, and that the movement of a linguistic innovation from one region to another 
depends on the density of interactions [9|. The simplest version of this theory supposes that the 
probability of contact between two individuals depends on their distance, so linguistic innovations 
should diffuse continuously through space. 

The gravity model refines this view, arguing that the likelihood of contact between individuals from 
two cities depends on the size of the cities as well as their distance; thus, linguistic innovations 
should be expected to travel between large cities first 1 10|. Labov argues for a cascade model in 
which many linguistic innovations travel from the largest city to the next largest, and so on 
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However, Nerbonne and Heeringa present a quantitative study of dialect differences in the Nether- 
lands, finding little evidence that population size impacts diffusion of pronunciation differences 1 1 2 1 . 

Cultural factors surely play an important role in both the diffusion of, and resistance to, language 
change. Many words and phrases have entered the standard English lexicon from dialects associated 
with minority groups: for example, cool, rip off, and uptight all derive from African American 
English Sl3i . Conversely, Gordon finds evidence that minority groups in the US resist regional sound 
changes associated with White speakers |14|. Political boundaries also come into play: Boberg 
demonstrates that the US-Canada border separating Detroit and Windsor complicates the spread of 
sound changes 1 15|. While our research does not distinguish the language choices of speakers within 
a given area, gender and local social networks can have a profound effect on the adoption of sound 
changes associated with a nearby city lfT6l . 

Methodologically, traditional sociolinguistics usually focuses on a small number of carefully chosen 
linguistic variables, particularly phonology 1 10., 111. Because such changes typically take place over 
decades, change is usually measured in apparent time, by comparing language patterns across age 
groups. Social media data lend a complementary perspective, allowing the study of language change 
by aggregating across thousands of lexical variables, without manual variable selection. (Nerbonne 
has shown how to build dialect maps while aggregating across regional pronunciation differences 
of many words [17], but this approach requires identifying matched words across each dialect.) 
Because language in social media changes rapidly, it is possible to observe change directly in real 
time, thus avoiding confounding factors associated with other social differences in the language of 
different age groups. 

2.2 Theoretical models of language change 

A number of abstract models have been proposed for language change, including dynamical sys- 
tems [18|, Nash equilibria It 19,1 , Bayesian learners |20|, agent-based simulations ||2ll, and popula- 
tion genetics ll22l . among others. In general, such research is concerned with demonstrating that 
a proposed theoretical framework can account for observed phenomena, such as the geographical 
distribution of linguistic features and their rate of adoption over time. In contrast, this paper fits a 
relatively simple model to a large corpus of real data. The integration of more complex modeling 
machinery seems an interesting direction for future work. 

2.3 Identifying networks from temporal data 

From a technical perspective, this work can be seen as an instance of network induction. Prior work 
in this area has focused on inducing a network of connections from a cascade of "infection" events. 
For example, Gomez-Rodriguez et al. reconstruct networks based on cascades of events |23|, such 
as the use of a URL on a blog 1 24 1 or the appearance of a short textual phrase 1 25 1 . We share the 
goal of reconstructing a latent network from observed data, but face an additional challenge in that 
our data do not contain discrete infection events. Instead we have changing word counts over time; 
our model must decide whether a change in word frequency is merely noise, or whether it reflects 
the spread of language change through an underlying network. 

3 Data 

This study is performed on a new dataset of social media text, gathered from the public "Garden- 
hose" feed offered by the microblog site Twitter (approximately 10% of all public posts). The dataset 
has been acquired by continuously pulling data from June 2009 to May 201 1, resulting in a total of 
494,865 individuals and roughly 44 million messages. In this study, we consider a subsample of 86 
weeks, from December 2009 to May 201 1. 

We considered only messages that contain geolocation metadata, which is optionally included by 
smartphone clients. We locate the latitude and longitude coordinates to a Zipcode Tabulation Area 
(zctaQ and use only messages from the continental USA. We use some content filtering to focus 
on conversational messages, excluding all retweets (both as identified in the official Twitter API, 

'Zipcode Tabulation Areas are defined by the U.S. Census: 'http : / /www . census . gov/ geo/www/ 
|cob/z52000. htmlj They closely correspond to postal service zipcodes. 
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as well as messages whose text includes the token "RT"), and also exclude messages that contain a 
URL. Grouping tweets by author, we retain only authors who have between 10 and 1000 messages 
over this timeframe. Each Twitter message includes a timestamp. We aggregate messages into 
seven-day intervals, which facilitates computation and removes any day-of-week effects. 

The United States Office of Management and Budget defines a set of Metropolitan Statistical Ar- 
eas (MSAs), which are not legal administrative divisions, but rather, geographical regions centered 
around a single urban core |26|. We consider the 200 most populous MSAs, and place each author 
in the MSA whose center is nearest to the author's geographical location (averaged among their 
messages). The most populous MSA is centered on New York City (population 19 million); the 
200th most populous is Burlington, VT (population 208,000). 

For each MSA we compute demographic attributes using information from the 2000 U.S. Census]^ 
We consider the following demographic attributes: % White; % African American; % Hispanic; 
% urban; % renters; median log income. Rather than using the aggregate demographics across the 
entire MSA, we correct for the fact that Twitter users are not an unbiased sample from the MSA. 
We identify the finer-grained Zip Code Tabulation Areas (ZCTA) from which each individual posts 
Twitter messages most frequently. The MSA demographics are then computed as the weighted aver- 
age of the demographics of the relevant ZCTAs. Of course. Twitter users may not be a representative 
sample of their ZCTAs, but these finer-grained units are specifically drawn to be demographically 
homogeneous, so on balance this should give a more accurate approximation of the demograph- 
ics of the individuals in the dataset. For the first author's home city of Atlanta, the unweighted 
MSA statistics are 55% White, 32% Black, 10% Hispanic; our weighted estimate is 53% White, 
40% Black, and 6% Hispanic. In the Pittsburgh MSA, the unweighted statistics are 90% White 
and 8% Black; our weighted estimate is 84% White and 12% Black. This coheres with our earlier 
finding that American Twitter users inhabit zipcodes with higher-than-average African American 
populations |3 1, and matches survey data showing a high rate of adoption for Twitter among African 
Americans [27 , 28,1 . 

We consider the 10,000 most frequent words (no hashtags or usernames were included), and further 
require that each word must be used more than five times in one week within a single metropolitan 
area. The second filtering step — which reduces the vocabulary to 1,818 words — prefers words that 
attain short-term popularity within a single region, as compared to words that are used infrequently 
but uniformly across space and time. Text was tokenized using the twokenize.py scriptj^which is de- 
signed to be robust to social media phenomena that confuse other tokenizers, such as emoticons 1 29 1 . 
We normalized repetition of the same character two or more times to just two (e.g. sooooo — ^ soo). 

4 Autoregressive model of language change 

We model language change as a simple dynamical system. For each word i, in region r during week 
t, we count the number of individuals who use i as Ci^r.t, and the total number of individuals who 
have posted any text at all as s^.t- Our model assumes that follows a binomial distribution: 
Ci^r.t ^ Binomial(sr,t, ^i,r,t), where 6',; ,.,* is the expected likelihood of each individual using the 
word i during the week t in region r. We treat 6*^ as the result of applying a logistic transformation 
to four pieces of information: the background log frequency of the word, v^, a temporal-regional 
effect quantifying the general activation of region r at time t, r^^t; a global temporal effect for word 
i at time i, 77;,*. t; and a regional-temporal activation for word i, rii^r,t- To summarize, 9i^r,t = 
a{vi + Tr^t + ■ni.*.t + ??i,r,t) where (j{x) = e^/ (1 + e^). 

Only the variables rji,j.,t capture geographically-specific changes in word frequency over time. We 
perform statistical inference over these variables, conditioned on the observed counts c and s, using 
a vector autoregression (VAR) model. The general form can be written as r|^ ^ f{ri^_i, . . . ri^_g), 
assuming Markov independence after a fixed lag of £. The function / can model dependencies 
both between pairs of words and across regions, but in this paper we consider a restricted form 
of /: autoregressive dependencies are assumed to be linear and first-order Markov; a single set 
of autoregressive coefficients is shared across all words; cross-word dependencies are otherwise 



^This part of the data was prepared before the 2010 Census information was released; we believe it is 
sufficient for developing our model and data analysis method, but plan to update it in future work. 

s : / /github . com/brendano/tweetmotif 
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count of individuals who use word i in region v at time t 


Sr,t 


count of individuals who post messages in region r at time t 


ft i 


CoLllllaLCQ piUUaUiiiLy Ui Ualllg WUlU L ill IC^lUll / al UlllC 6 


Vi 


overall log-frequency of word i 


Tr,t 


general activation of region r at time t 




global activation of word i at time t 


rii,r,t 


activation of word i at time t in region r 


lli t 


vertical concatenation of each »7i,r,t and r]i.i,^t,& vector of size f? + 1 


a{-) 


the logistic function, ij(x) — e^/(l + e^) 


A 


autoregressive coefficients (size R x R] 


r 


variance of the autoregressive process (size R x R) 




parameter of the Taylor approximation to the logistic binomial likelihood 




Gaussian pseudo-emission in the Kalman smoother 




emission variance in the Kalman smoother 


(k) 


weight of particle k in the forward pass of the sequential Monte Carlo algorithm 



Table 1 : Summary of mathematical notation. The index i indicates words, r indicates regions (MSAs), and t 
indicates time (weeks). 



ignored. This yields the linear model: 

r|^ ^ Normal(A?7j t„i, T) Ci^r,t ~ Binomial(sr,t, (7(1/^ + r^,* + Tj^^^^t + Vi,r,t)) (1) 

where the region-to-region coefficients A govern lexical diffusion for all words. We rewrite the sum 
?/i,*,t + '7i,r.t as a vector product /ir^7i t' where 77^ ^ is the vertical concatenation of each rji^r.t and 
T^i^ (, and hr is a row indicator vector that picks out the elements 77^ ,.,t and 77^ * 4. 

Our ultimate goal is to estimate confidence intervals around the cross-regional autoregression coef- 
ficients A, which are computed as a function of the regional-temporal word activations J]i,r.t- We 
take a Monte Carlo approach, computing samples for the trajectories 77^ ^, and then computing point 
estimates of A for each sample, aggregating over all words i. Bayesian confidence intervals can then 
be computed from these point estimates, regardless of the form of the estimator used to compute A. 
We now discuss these steps in more detail. 

4.1 Sequential Monte Carlo estimation of word activations 

To obtain smoothed estimates of rj, we apply a sequential Monte Carlo (SMC) smoothing algorithm 
known as Forward Filtering Backward Sampling (FFBS) ll30l . The algorithm appends a backward 
pass to any SMC filter that produces a set of particles and weights , uj^'^J t}i<k<K- Our forward 
pass is a standard bootstrap filter ||3T| : by setting the proposal distribution q{'rii,r,t\Vi.r,t-i) equal 
to the transition distribution P(r7i j-.tl^yi t_i; A, F), the forward particle weights are equal to the 
recursive product of the emission hkelihoods, 

uj^'^J^ = w[^^-|j_iBinomial(Q,,.,t; Sr,t, (^{v, + Tr,t + hrfi-^^)). (2) 

We experimented with more complex SMC algorithms, including resampling, annealing, and more 
accurate proposal distributions, but none consistently achieved higher likelihood than the straight- 
forward bootstrap filter. 

FFBS converts the filtered estimates P{r]i.r,t\ci,r,i:tT Sr,i:t) to a smoothed estimate 
^'('7i,r,t|ci,r,i:T, Sr,i:T) by resampling the forward particles in a backward pass. In this pass, 

at each time t, we select particle vif^t with probabihty proportional to ujf}fP{rii,r,t+i\rii.r,t), which 
is the filtering weight multiplied by the transition probability. When we reach < = 1, we have 
obtained an unweighted draw from the distribution P{r}^ ^ i-TWi.r,i:T , ■Sr.iiT; A, F, v, r). We can 
use these draws to estimate the distribution of any arbitrary function of -q^. 
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Figure 2: Left: Monte Carlo and Kalman smoothed estimates of 77 for the word ctfii in five cities. Right: 
estimates of term frequency in each city. Circles show empirical frequencies. 

4.2 Estimation of system dynamics parameters 

The parameter T controls the variance of the latent state, and the diagonal elements of A control 
the amount of autocorrelation within each region. The maximum likelihood settings of these pa- 
rameters are closely tied to the latent state -q. For this reason, we estimate these parameters using 
an expectation-maximization algorithm, iterating between maximum-likelihood estimation of F and 
diag(A) and computation of a variational bound on P{ri^\ci, s; F, A). We run the sequential Monte 
Carlo procedure described above only after converging to a local optimum for F and diag(A). This 
combination of variational and Monte Carlo inference offers advantages of both: variational infer- 
ence gives speed and guaranteed convergence for parameter estimation, while Monte Carlo inference 
allows the computation of confidence intervals around arbitrary functions of the latent variables 77. 
Similar hybrid approaches have been proposed in the prior literature [32J. 

The variational bound on P(?7j|c, , s; F, A) is obtained from a Gaussian approximation to the Bino- 
mial emission distribution. This enables the computation of the joint density by a Kalman smoother, 
a two-pass procedure for computing smoothed estimates of a latent state variable given Gaussian 
emission and transition distributions. For each rji^r.t, we take a second-order Taylor approximation 
to the emission distribution at the point Ci,r,t- This yields a Gaussian approximation, 

Binomial(ci,r,i|sr,t, (t[v., + r^,* + rj^^^^t + Vi,r,t)) ~ Normal(mi,r,t| 77^,^,4, ^\r,t)^ (3) 

where the parameters m and Y? depend on a Taylor approximation parameter Ci,r,t, 

= {SrM^,,r^t)[l - <y{Q,r.t))r^ (4) 
mi^r,t ^YlrA(^i,r,t - Sr,tCr(Cj,r,t)) + Ci,r,t - Tr.t - Vi (5) 
Ci,r,t ^'>li,r,t + Vi,*,t + Tr^t + I'i (6) 

Intuitively, the emission parameter m depends on the gap between the observed counts c and the 
expected counts scr(C)- We initialize Ci.r.t to the relative frequency estimate cr^^(^^j^), and then 
iteratively update it to improve the quality of the approximation. 

During initialization, the parameters (overall word log-frequency) and r^.t (global activa- 
tion in region r at time t) are fixed to their maximum-likelihood point estimates, assuming 
?/i,r,t — Vi,*,t = 0. The global word popularity r/i.*.t is a latent variable; we perform infer- 
ence by including it in the latent state of the Kalman smoother. The final state equations are: 

rj^ t -Normal(A?7i t_i,F) mi,t ~ Normal(Hj7^ t, E^^), (7) 

where the matrix A is diagonal and the matrix H is a vertical concatenation of all row vectors h,,. 
(equivalently, it is a horizontal concatenation of the identity matrix with a column vector of Is). As 
we now have Gaussian equations for both the emissions and transitions, we can apply a standard 
Kalman smoother (we use an optimized version of the Bayes Net Toolbox for Matlab |33|). The 
EM algorithm aiTives at local optima of the data likelihood for the emission variance F and the 
auto-covariance on the diagonal of A. 

Figure |2] shows the estimates of rj for the word ctfu (cracking the fuck up) in five metropolitan areas. 
The strong dotted lines show the 95% confidence intervals of the Kalman smoother; each of 100 
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Figure 3: Am,n estimates for all (m,n), showing intervals ^m,n ± 3. 12am, n- Left: Both self-effects and 
cross-region effects are shown (excluding two having z > 500). Right: only cross-region effects. 



FFBS samples is shown as a light dotted line. The right panel shows the estimated term frequencies 
with lines, and the empirical frequencies with circles. The term was used only in Cleveland (among 
these cities) at the beginning of the dataset, but eventually became popular in Philadelphia, Pitts- 
burgh, and Erie (Pennsylvania). The wider confidence intervals for Erie reflect greater uncertainty 
due to the smaller overall counts for this city. 



4.3 Estimating cross-region effects 

The maximum-likelihood estimate for the system coefficients is ordinary least squares, 

A = (»7Lt-i»7i:T-i) vI-.tVi-.t-i (8) 



The regression includes a bias term. We experimented with ridge regression for regularization 
(penalizing the squared Frobenius norm of A), comparing regularization coefficients by cross- 
validation within the time series. We found that regularization can improve the likelihood on held- 
out data when computing A for individual words, but when words are aggregated, regularization 
yields no improvement. 

Next, we compute confidence estimates for A. We compute A'^'^^ for each sampled sequence tj*^'^^ 
summing over all words. We can then compute Monte Carlo-based confidence intervals around each 

entry Am^n, by fitting a Gaussian to the samples: = ^ Y.k <^m,n = F 

f^m,n)^- To identify coefficients that are very likely to be non-zero, we compute Wald tests using 
z-scores z^.n = fJ-m.n/o''^ „. We apply the Benjamini-Hochberg False Discovery Rate correction 
for multiple hypothesis tests 1341 , computing a one-tailed z-score threshold z such that the total 
proportion of false positives will be less than .01. This test finds a z such that 

E[#{that pass under null hypothesis}] 200(199)(1 ~ 

Uti = U.Ui = — p- — — rr^i = jr? r^; (v) 

#{that pass empirically} w{z,n,n > z\ 

Coefficients that pass this threshold can be confidently considered to indicate cross-regional influ- 
ence in the autoregressive model. 



5 Analysis 

We apply this method to the Twitter data described in Section [3] In practice, we initialize each r/^ 
to smoothed relative-frequency estimates, and run the EM-Kalman procedure for 100 steps or until 
convergence. We then draw 100 samples of rj, from the FFBS smoother. These samples are used 
to compute confidence intervals around the autoregression coefficients A. When aggregating across 
all words, the total number of significant (to, n) interactions is 3544, out of 39,800 possible, with 
z-score threshold of 3.12. Figure |3] visually depicts these estimates; the FDR of 0.01 indicates that 
the blue points occur 100 times more often than they would by chance. 

Figure |4] shows high-confidence, high-influence links among the 50 largest metropolitan statistical 
areas in the United States. For more precise inspection. Figure |5] shows, for each city, all high- 



7 




Figure 4: Lexical influence network: high-confidence, high-influence links > 3.12, fi > 0.025). Left: 
among all 50 largest MS As. Right: subnetwork for the Eastern seaboard area. See also Figure |5] which uses 
same colors for cities. 





geo distance 


% White 


% Af. Am. 


% Hispanic 


% urban 


%renter 


log income 


linked 
unlinked 


10.5 ±0.5 
20.8 ±0.6 


10.2 ±0.4 
16.2 ±0.5 


8.45 ±0.37 
16.3 ±0.5 


9.88 ±0.64 
15.2 ±0.8 


9.55 ±0.34 
12.0 ±0.4 


6.30 ±0.24 
6.78 ±0.25 


0.181 ±0.006 
0.201 ±0.007 



Table 2: Geographical distances and absolute demographic differences for linked and non-linked pairs of 
MSAs. Confidence intervals arep < .01, two-tailed. 



confidence influence links to and from all other cities (also called ego networks). The role of geo- 
graphical proximity is apparent: there are dense connections within regions such as the northeast, 
midwest, and west coast, and relatively few cross-country connections. 

By analyzing the properties of pairs of cities with significant influence, we can identify the geo- 
graphic and demographic drivers of linguistic influence. First, we consider the ways in which sim- 
ilarity and proximity between regions causes them to share linguistic influence. Then we consider 
the asymmetric properties that cause some regions to be linguistically influential. 

5.1 Symmetric properties of linguistic influence 

To assess symmetric properties of linguistic influence, we compare pairs of MSAs linked by non- 
zero autoregressive coefficients, versus randomly-selected pairs of MSAs. Specifically, we compute 
the empirical distribution over senders and receivers (how often each city fills each role), and we 
sample pairs from these distributions. The baseline thus includes the same distribution of MSAs as 
the experimental condition, but randomizes the associations. Even if our model were predisposed to 
detect influence among certain types of MSAs (for example, large or dense cities), that would not 
bias this analysis, since the aggregate makeup of the linked and non-linked pairs is identical. 

Table[2]shows the similarities between pairs of cities that are linguistically linked (line 1) or selected 
randomly (line 2). Cities indicated as linked by our model are more geographically proximal than 
randomly-selected cities, and are more demographically similar on every measured dimension. All 
effects are significant atp < 0.05; the percentage of renters just misses the threshold for p < 0.01. 

Because demographic attributes correlate with each other and with geography, it is possible that 
some of these homophily effects are spurious. To disentangle these factors, we perform a multiple 
regression. We choose a classification framework, treating each linked city pair as a positive ex- 
ample, and randomly selected non-linked pairs as negative examples. Logistic regression is used to 
assign weights to each of several independent variables: product of populations ifTOl . geographical 
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Los Angeles, CA 



Philadelphia, PA 
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Figure 5: Ego networks for each of the 50 largest MS As. Unhke Figure|4] all incoming and outgoing links are 
included (having 2-score > 3.12). A blue link between cities indicates there is both an incoming and outgoing 
edge; green indicates outgoing-only; orange indicates incoming-only. Maps are ordered by population size. 
Official MSA names include up to three city names to describe the area; we truncate to the first. 
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Table 3: Logistic regression to predict linked pairs of MS As 



(a) Logistic regression coefficients predicting influence links 

between MSAs. Bold typeface indicates statistical signifi- (b) Accuracy of predicting influence 

cance at p < .01. links, with ablated feature sets. 





estimate 


s.e. 


f-value 


feature set 


accuracy 


gap 


intercept 


-0.0601 


0.0287 


-2.10 


all features 


72.3 




product of populations 


0.393 


0.048 


8.22 


-population 


71.6 


0.7 


distance 


-0.870 


0.033 


-26.1 


-geography 


67.6 


4.7 


abs. diff. % White 


-0.214 


0.040 


-5.39 


-demographics 


66.9 


5.4 


abs. diff. % Af. Am. 


-0.693 


0.042 


-16.7 








abs. diff. % Hispanic 


-0.140 


0.030 


-4.63 








abs. diff. % urban 


-0.170 


0.030 


-5.76 








abs. diff. % renters 


-0.0314 


0.0304 


-1.04 








abs. diff. log income 


0.0458 


0.0301 


1.52 









Log pop. % White % Af. Am % Hispanic % Urban % Renters Log income 

difference 0:968 -0.0858 00703 0.0094 0.0612 00231 0.0546 

s.e. 0.0543 0.0065 0.0063 0.0098 0.0054 0.0041 0.0113 

2-score 17.8 -13.2 11.1 O950 11.3 5.67 4.82 



Table 4: Differences in demographic attributes between senders and receivers of lexical influence. Bold type- 
face indicates statistical significance atp < .01. 

proximity, and the absolute difference of each demographic feature. All features are standardized. 



The resulting coefficients and confidence intervals are shown in Table 3(a) Product of popula- 
tions, geographical proximity, and similar proportions of African Americans are the most clearly 
important predictors. Even after accounting for geography and population size, language change 
is significantly more likely to be propagated between regions that are demographic ally similar — 
particularly with respect to race and ethnicity. 

Finally, we consider the impact of removing features on the accuracy of classification for whether a 



pair of MSAs are linguistically linked by our model. Table 3(b) shows the classification accuracy, 
computed over five-fold cross-validation. Removing the population feature impairs accuracy to a 
small extent; removing either of the other two feature sets makes accuracy noticeably worse. 



5.2 Asymmetric properties of linguistic influence 

Next we evaluate asymmetric properties of linguistic influence. The goal is to identify the features 
that make metropolitan areas more likely to send rather than receive lexical influence. Places that 
possess many of these characteristics are likely to have originated or at least popularized many of 
the neologisms observed in social media text. We consider the characteristics of the 466 pairs of 
MSAs in which influence is detected in only one direction. 

Table |4] shows the average difference in population and demographic attributes between the sender 
and receiver cities in each of the 466 asymmetric pairs. Senders have significantly higher popula- 
tions, more African Americans, fewer Whites, more renters, greater income, and are more urban 
(p < 0.01 in all cases). Because demographic attributes and population levels are correlated, we 
again turn to logistic regression to try to identify the unique contributing factors. In this case, the 
classification problem is to identify the sender in each asymmetric pair. As before, all features are 
standardized. The feature weights from training on the full dataset are shown in Table |5] Only the 
coefficients for population size and percentage of African Americans are statistically significant. 
In 5-fold cross-validation, this classifier achieves 82% accuracy (population alone achieves 78% 
accuracy; without the population feature, accuracy is 77%). 



Log pop. % White % Af. Am % Hispanic % Urban % Renters Log income 

weights 2^2 41246 LOS 0.0914 4)129 -O0133 0225 

s.e. O290 0315 0343 0229 0221 0.180 0194 

t-score 7.68 -0.78 3.15 0.40 -0.557 -O0736 1.16 



Table 5: Regression coefficients for predicting direction of influence. Bold typeface indicates statistical signif- 
icance at p < .01. 
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6 Conclusions 



This paper presents an aggregate analysis of the changing frequencies of thousands of words in social 
media text, over the course of roughly one and a half years. From these changing word counts, we are 
able to reconstruct a network of linguistic influence; by cross-referencing this network with census 
data, we identify the geographic and demographic factors that govern the shape of this network. 

We find strong roles for both geography and demographics, but the role of demographics is espe- 
cially remarkable given that our analysis is centered on metropolitan statistical areas. MSAs are 
by definition geographically homogeneous, but they are heterogeneous along every demographic 
attribute. Nonetheless, demographically-similar cities are significantly more likely to share linguis- 
tic influence. Among demographic attributes, racial homophily seems to play the strongest role, 
although we caution that demographic properties such as socioeconomic class are more difficult to 
assess from census statistics. 

At the level of individual language users, demographics may play a stronger role than geography, 
as has been asserted in the case of African American English 1 14|. Further research is necessary to 
assess how the geographical diffusion of lexical innovation is modulated by demographic variation 
within each geographical area. A second direction for future research is to relax the assumption 
that word activations evolve independently. Many innovative words reflect abstract orthographic 
patterns — for example, the spelling bruh employs a transcription of a particular pronunciation for 
bro, and this transcription pattern may be applied to other words that have similar pronunciation. 
A latent variable model could identify sets of words with similar shape features and spatiotemporal 
characteristics, while jointly identifying a set of influence matrices. 
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